Smoothing the subtrans performance catastrophe

  • Jump to comment-1
    simon.riggs@enterprisedb.com2022-08-01T16:42:49+00:00
    "A mathematical catastrophe is a point in a model of an input-output system, where a vanishingly small change in the input can produce a large change in the output." We have just such a change in Postgres: when a snapshot overflows. In this case it takes only one subxid over the subxid cache limit to slow down every request in XidInMVCCSnapshot(), which becomes painful when a long running transaction exists at the same time. This situation has been noted by various bloggers, but is illustrated clearly in the attached diagram, generated by test results from Julien Tachoires. The reason for the slowdown is clear: when we overflow we check every xid against subtrans, producing a large stream of lookups. Some previous hackers have tried to speed up subtrans - this patch takes a different approach: remove as many subtrans lookups as possible. (So is not competing with those other solutions). Attached patch improves on the situation, as also shown in the attached diagram. The patch does these things: 1. Rework XidInMVCCSnapshot() so that it always checks the snapshot first, before attempting to lookup subtrans. A related change means that we always keep full subxid info in the snapshot, even if one of the backends has overflowed. 2. Use binary search for standby snapshots, since the snapshot subxip is in sorted order. 3. Rework GetTopmostTransaction so that it a) checks xmin as it goes, b) only does one iteration on standby snapshots, both of which save subtrans lookups in appropriate cases. (This was newly added in v6) Now, is this a panacea? Not at all. What this patch does is smooth out the catastrophic effect so that a few overflowed subxids don't spoil everybody else's performance, but eventually, if many or all sessions have their overflowed subxid caches then the performance will descend as before, albeit that the attached patch has some additional optimizations (2, 3 above). So what this gives is a better flight envelope in case of a small number of occasional overflows. Please review. Thank you. -- Simon Riggs http://www.EnterpriseDB.com/
    • Jump to comment-1
      andres@anarazel.de2022-08-03T19:18:26+00:00
      Hi, On 2022-08-01 17:42:49 +0100, Simon Riggs wrote: > The reason for the slowdown is clear: when we overflow we check every > xid against subtrans, producing a large stream of lookups. Some > previous hackers have tried to speed up subtrans - this patch takes a > different approach: remove as many subtrans lookups as possible. (So > is not competing with those other solutions). > > Attached patch improves on the situation, as also shown in the attached diagram. I think we should consider redesigning subtrans more substantially - even with the changes you propose here, there's still plenty ways to hit really bad performance. And there's only so much we can do about that without more fundamental design changes. One way to fix a lot of the issues around pg_subtrans would be remove the pg_subtrans SLRU and replace it with a purely in-memory hashtable. IMO there's really no good reason to use an SLRU for it (anymore). In contrast to e.g. clog or multixact we don't need to access a lot of old entries, we don't need persistency etc. Nor is it a good use of memory and IO to have loads of pg_subtrans pages that don't point anywhere, because the xid is just a "normal" xid. While we can't put a useful hard cap on the number of potential subtrans entries (we can only throw subxid->parent mappings away once no existing snapshot might need them), saying that there can't be more subxids "considered running" at a time than can fit in memory doesn't seem like a particularly problematic restriction. So, why don't we use a dshash table with some amount of statically allocated memory for the mapping? In common cases that will *reduce* memory usage (because we don't need to reserve space for [as many] subxids in snapshots / procarray anymore) and IO (no mostly-zeroes pg_subtrans). Greetings, Andres Freund
      • Jump to comment-1
        simon.riggs@enterprisedb.com2022-08-04T12:11:25+00:00
        On Wed, 3 Aug 2022 at 20:18, Andres Freund <andres@anarazel.de> wrote: > On 2022-08-01 17:42:49 +0100, Simon Riggs wrote: > > The reason for the slowdown is clear: when we overflow we check every > > xid against subtrans, producing a large stream of lookups. Some > > previous hackers have tried to speed up subtrans - this patch takes a > > different approach: remove as many subtrans lookups as possible. (So > > is not competing with those other solutions). > > > > Attached patch improves on the situation, as also shown in the attached diagram. > > I think we should consider redesigning subtrans more substantially - even with > the changes you propose here, there's still plenty ways to hit really bad > performance. And there's only so much we can do about that without more > fundamental design changes. I completely agree - you will be glad to hear that I've been working on a redesign of the subtrans module. But we should be clear that redesigning subtrans has nothing to do with this patch; they are separate ideas and this patch relates to XidInMVCCSnapshot(), an important caller of subtrans. I will post my patch, when complete, in a different thread. > One way to fix a lot of the issues around pg_subtrans would be remove the > pg_subtrans SLRU and replace it with a purely in-memory hashtable. IMO there's > really no good reason to use an SLRU for it (anymore). > > In contrast to e.g. clog or multixact we don't need to access a lot of old > entries, we don't need persistency etc. Nor is it a good use of memory and IO > to have loads of pg_subtrans pages that don't point anywhere, because the xid > is just a "normal" xid. > > While we can't put a useful hard cap on the number of potential subtrans > entries (we can only throw subxid->parent mappings away once no existing > snapshot might need them), saying that there can't be more subxids "considered > running" at a time than can fit in memory doesn't seem like a particularly > problematic restriction. I do agree that sometimes it is easier to impose restrictions than to try to provide unbounded resources. Having said that, I can't see an easy way of making that work well in practice for this case. Making write transactions just suddenly stop working at some point doesn't sound like it would be good for availability, especially when it happens sporadically and unpredictably as that would, whenever long running transactions appear alongside users of subtransactions. > So, why don't we use a dshash table with some amount of statically allocated > memory for the mapping? In common cases that will *reduce* memory usage > (because we don't need to reserve space for [as many] subxids in snapshots / > procarray anymore) and IO (no mostly-zeroes pg_subtrans). I considered this and have ruled it out, but as I said above, we can discuss that on a different thread. -- Simon Riggs http://www.EnterpriseDB.com/
      • Jump to comment-1
        robertmhaas@gmail.com2022-08-03T19:36:40+00:00
        On Wed, Aug 3, 2022 at 3:18 PM Andres Freund <andres@anarazel.de> wrote: > In contrast to e.g. clog or multixact we don't need to access a lot of old > While we can't put a useful hard cap on the number of potential subtrans > entries (we can only throw subxid->parent mappings away once no existing > snapshot might need them), saying that there can't be more subxids "considered > running" at a time than can fit in memory doesn't seem like a particularly > problematic restriction. That sounds really problematic to me, unless I misunderstand what you're proposing. Say I have a plpgsql containing a FOR loop which in turn contains an EXCEPTION block which in turn does DML. Right now, that loop could iterate millions of times and everything would still work. Sure, there might be performance impacts depending on what else is happening on the system, but it might also be totally fine. IIUC, you'd like to make that case fail outright. I think that's a non-starter. I don't know whether Simon's ideas here are amazingly good, utterly terrible, or something in between, but I think we can evaluate the patch actually submitted rather than propose a complete redesign of the entire mechanism - especially one that seems like it would break stuff that currently works. -- Robert Haas EDB: http://www.enterprisedb.com
        • Jump to comment-1
          andres@anarazel.de2022-08-03T20:14:37+00:00
          Hi, On 2022-08-03 15:36:40 -0400, Robert Haas wrote: > On Wed, Aug 3, 2022 at 3:18 PM Andres Freund <andres@anarazel.de> wrote: > > In contrast to e.g. clog or multixact we don't need to access a lot of old > > While we can't put a useful hard cap on the number of potential subtrans > > entries (we can only throw subxid->parent mappings away once no existing > > snapshot might need them), saying that there can't be more subxids "considered > > running" at a time than can fit in memory doesn't seem like a particularly > > problematic restriction. > > That sounds really problematic to me, unless I misunderstand what > you're proposing. Say I have a plpgsql containing a FOR loop which in > turn contains an EXCEPTION block which in turn does DML. Right now, > that loop could iterate millions of times and everything would still > work. Sure, there might be performance impacts depending on what else > is happening on the system, but it might also be totally fine. IIUC, > you'd like to make that case fail outright. I think that's a > non-starter. I don't think this scenario would fundamentally change - we already keep the set of subxids in backend local memory (e.g. either a dedicated TransactionStateData or an element in ->childXids) and in the locking table (via XactLockTableInsert()). The problematic part isn't keeping "actually" running subxids in memory, but keeping subxids that might be "considered running" in memory (i.e. subxids that are considered running by an old snapshot in another backend). A hashtable just containing child->parent mapping for subxids doesn't actually need that much memory. It'd be approximately (2 * 4 bytes) * subxids * (2-fillfactor) or such? So maybe ~10MB for 1 milllion subxids? Allocating that on-demand doesn't strike me as prohibitive. > I don't know whether Simon's ideas here are amazingly good, utterly > terrible, or something in between, but I think we can evaluate the > patch actually submitted rather than propose a complete redesign of > the entire mechanism - especially one that seems like it would break > stuff that currently works. We've had quite a few patches that try to address issues around subids, but only ever improve things on the margins. I'm doubtful that's a useful use of time. Greetings, Andres Freund
          • Jump to comment-1
            robertmhaas@gmail.com2022-08-04T13:17:23+00:00
            On Wed, Aug 3, 2022 at 4:14 PM Andres Freund <andres@anarazel.de> wrote: > I don't think this scenario would fundamentally change - we already keep the > set of subxids in backend local memory (e.g. either a dedicated > TransactionStateData or an element in ->childXids) and in the locking table > (via XactLockTableInsert()). Sure.... > The problematic part isn't keeping "actually" running subxids in memory, but > keeping subxids that might be "considered running" in memory (i.e. subxids > that are considered running by an old snapshot in another backend). > > A hashtable just containing child->parent mapping for subxids doesn't actually > need that much memory. It'd be approximately (2 * 4 bytes) * subxids * > (2-fillfactor) or such? So maybe ~10MB for 1 milllion subxids? Allocating > that on-demand doesn't strike me as prohibitive. I mean the worst case is ~2 bn, no? -- Robert Haas EDB: http://www.enterprisedb.com
    • Jump to comment-1
      dilipbalaut@gmail.com2022-08-03T11:55:42+00:00
      On Mon, Aug 1, 2022 at 10:13 PM Simon Riggs <simon.riggs@enterprisedb.com> wrote: > > "A mathematical catastrophe is a point in a model of an input-output > system, where a vanishingly small change in the input can produce a > large change in the output." > > We have just such a change in Postgres: when a snapshot overflows. In > this case it takes only one subxid over the subxid cache limit to slow > down every request in XidInMVCCSnapshot(), which becomes painful when > a long running transaction exists at the same time. This situation has > been noted by various bloggers, but is illustrated clearly in the > attached diagram, generated by test results from Julien Tachoires. > > The reason for the slowdown is clear: when we overflow we check every > xid against subtrans, producing a large stream of lookups. Some > previous hackers have tried to speed up subtrans - this patch takes a > different approach: remove as many subtrans lookups as possible. (So > is not competing with those other solutions). > > Attached patch improves on the situation, as also shown in the attached diagram. > > The patch does these things: > > 1. Rework XidInMVCCSnapshot() so that it always checks the snapshot > first, before attempting to lookup subtrans. A related change means > that we always keep full subxid info in the snapshot, even if one of > the backends has overflowed. > > 2. Use binary search for standby snapshots, since the snapshot subxip > is in sorted order. > > 3. Rework GetTopmostTransaction so that it a) checks xmin as it goes, > b) only does one iteration on standby snapshots, both of which save > subtrans lookups in appropriate cases. > (This was newly added in v6) > > Now, is this a panacea? Not at all. What this patch does is smooth out > the catastrophic effect so that a few overflowed subxids don't spoil > everybody else's performance, but eventually, if many or all sessions > have their overflowed subxid caches then the performance will descend > as before, albeit that the attached patch has some additional > optimizations (2, 3 above). So what this gives is a better flight > envelope in case of a small number of occasional overflows. > > Please review. Thank you. +1, I had a quick look into the patch to understand the idea and I think the idea looks really promising to me. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com